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Abstract. Density modeling is notoriously difficult for high dimensional 
data. One approach to the problem is to search for a lower dimensional 
manifold which captures the main characteristics of the data. Recently, 
the Gaussian Process Latent Variable Model (GPLVM) has successfully 
been used to find low dimensional manifolds in a variety of complex 
data. The GPLVM consists of a set of points in a low dimensional latent 
space, and a stochastic map to the observed space. We show how it can 
be interpreted as a density model in the observed space. However, the 
GPLVM is not trained as a density model and therefore yields bad den- 
sity estimates. We propose a new training strategy and obtain improved 
generalisation performance and better density estimates in comparative 
evaluations on several benchmark data sets. 

Modeling of densities, aka unsupervised learning, is one of the central prob- 
lems in machine learning. Despite its long history [1], density modeling remains 
a challenging task especially in high dimensional spaces. For example, the gen- 
erative approach to classification requires density models for each class, and 
training such models well is generally considered more difficult than the alterna- 
tive discriminative approach. Classical approaches to density modeling include 
both parametric and non parametric methods. In general, simple parametric 
approaches have limited utility, as the assumptions might be too restrictive. 
Mixture models, typically trained using the EM algorithm, are more flexible, 
but e.g. Gaussian mixture models are hard to fit in high dimensions, as each 
component is either diagonal or has in the order of parameters, although the 
mixtures of Factor Analyzers algorithm [1] may be able to strike a good balance. 
Methods based on kernel density estimation [^5,4] are another approach, where 
bandwidths may be set using cross validation [.'>]. 

The methods mentioned so far have two main shortcomings: 1) they typically 
do not perform well in high dimensions, and 2) they do not provide an intuitive or 
generative understanding of the data. Generally, we can only succeed if the data 
has some regular structure, the model can discover and exploit. One attempt 
to do this is to assume that the data points in the high dimensional space lie 
on - or close to - some smooth underlying lower dimensional manifold. Models 
based on this idea can be divided into models based on implicit or explicit 
representations of the manifold. An implicit representation is used by [G] in a 
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non-parametric Gaussian mixture with adaptive covariance to every data point. 
Explicit representations are used in the Generative Topographic Map [ ] and 
by [8]. Within the explicit camp, models contain two separate parts, a lower 
dimensional latent space equipped with a density, and a function which maps 
points from the low dimensional latent space to the high dimensional space 
where the observations lie. Advantages of this type of model include the ability 
to understand the structure of the data in a more intuitive way using the latent 
representation, as well as the technical advantage that the density in the observed 
space is automatically properly normalised by construction. 

The Gaussian Process Latent Variable Model (GPLVM) [ ] uses a Gaussian 
process (GP) [10] to define a (stochastic) map between a lower dimensional latent 
space and the observation space. However, the GPLVM does not include a density 
in the latent space. In this paper, we explore extensions to the GPLVM based on 
densities in the latent space. One might assume that this can trivially be done, 
by thinking of the latent points learnt by the GPLVM as representing a mixture 
of delta functions in the latent space. Since the GP based map is stochastic, it 
induces a proper mixture in the observed space. However, this formulation is 
unsatisfactory, because the resulting model is not trained as a density model. 
Consequently, our experiments show poor density estimation performance. 

Mixtures of Gaussians form the basis of the vast majority of density estima- 
tion algorithms. Whereas kernel smoothing techniques can be seen as introducing 
a mixture component for each data point, infinite mixture models [J i] explore 
the limit as the number of components increases and mixtures of factor analysers 
impose constraints on the covariance of individual components. The algorithm 
presented in this paper can be understood as a method for stitching together 
Gaussian mixture components in a way reminiscent of [ ] using the GPLVM 
map from the lower dimensional manifold to induce factor analysis like con- 
straints in the observation space. In a nutshell, we propose a density model in 
high dimensions by transforming a set of low-dimensional Gaussians with a GP. 

We begin by a short introduction to the GPLVM and show how it can be 
used to define density models. In section 2, we introduce a principled learning 
algorithm, and experimentally evaluate our approach in section 3. 

1 The GPLVM as a Density Model 

A GP / is a probabilistic map parametrised by a covariance fc(x, x') and a 
mean to(x). We use to(x) = and automatic relevance determination (ARD) 
fc(x*, x-') = aj exp (— i(x* — x^)^W^^(x* — x^)) -I- S^ja'^ in the following. Here, 
(Ty and af^ denote the signal and noise variance, respectively and the diagonal 
matrix W contains the squared length scales. Since a GP is a distribution over 
functions f : X ^ Z, the output z = /(x) is random even though the input x 
is deterministic. In GP regression, a GP prior is combined with training data 
{x',z*}igi^{i jv} into a GP posterior conditioned on the training data with 
mean /Lt*(x*) = a^k* and covariance fc(x»,x'^) = fc(x*,x'^) — k7K~^k'^ where 
k* = [A:(xi,x*),..,fc(x^,x,)]T, K = [fc(x%xJ)]y=i..jv, = [fc(x% x^)]y=i..Ar 
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Fig. 1. The high dimensional {D — 2) density of the data points z* in panel 
(c) is modelled by a mixture of Gaussians P(z) = A/'(z|/Xxi, -S'xi) shown in 

panel (b,d) where the means and variances S^i are given by the predictive 
means and covariances of a set of independent Gaussian processes fj-.X^ 
Z conditioned on low-dimensional (d = 1) latent locations x\ A latent Dirac 
mixture (a) yields a spherical Gaussian mixture with varying widths (b) and a 
latent Gaussian mixture (e) results in a fully coupled mixture model (d) smoothly 
sharing covariances across mixture components. 

and OL^ — [z^, .., z^]K^^. Deterministic inputs x lead to Gaussian outputs and 
Gaussian inputs lead to non-Gaussian outputs whose first two moments can 
be computed analytically [12] for ARD covariance. Multivariate deterministic 
inputs X lead to spherical Gaussian outputs z and Gaussian inputs x lead to 
non-Gaussian outputs z whose moments {ii^ , ) are given by: 



X ~ 5(x.) z ~ J^(ii,, all) 


X ~ A^(x., V,) z ~* TV'l/i., S,) 


fi, = A^k, 


al = k„ - k^K-ifc. e [al.al + a}] 




S, = (^k„ - tr(K-iK,)j I + A^(K, - k.kJjA 



Here, = [a.i, .., ajj]^ = [z^, .., z^]K^^ and the quantities k» = E[k] and 

= E[kk^] denote expectations of k = k(x) = [k{'x.^,x), ..,fc(x^,x)]^ w.r.t. 
the Gaussian input distribution A/'(x|x*, Vx) that can readily be evaluated in 
closed form [12] as detailed in the Appendix. In the limit of Vx — > we recover 
the deterministic case as k* — k*, K* — >• k*kj and all. Non-zero 

input variance Vx results in full non-spherical output covariance -57*, even for 
independent GPs because all the GPs are driven by the same (uncertain) input. 

A GPLVM [9] is a successful and popular non-parametric Bayesian tool for 
high dimensional nonlinear data modeling taking into account the data's mani- 
fold structure based on a low-dimensional representation. High dimensional data 
points z* e Z C MP , Z = [z^, . . . ,z^], are represented by corresponding latent 
points X = [x^, . . . ,x^] from a low-dimensional latent space X (ZW^ mapped 
into Z by D independent GPs fj - one for each component Zj of the data. All the 
GPs fj are conditioned on X and share the same covariance and mean functions. 
The model is trained by maximising the sum of the log marginal likelihoods over 
the D independent regression problems with respect to the latent points X. 

While most often applied to nonlinear dimensionality reduction, the GPLVM 
can also be used as a tractable and flexible density model in high dimensional 
spaces as illustrated in Figure 1. The basic idea is to interpret the latent points X 



4 



Hannes Nickisch and Carl E. Rasmussen 



as centres of a mixture of either Dirac (Figure la) or Gaussian (Figure le) distri- 
butions in the latent space X that are projected forward by the GP to produce a 
high dimensional Gaussian mixture P(z) — A/'(z|/Xxm ^xO in the observed 

space Z. Depending on the kind of latent mixture, the density model P(z) will 
either be a mixture of spherical (Figure lb) or full-covariance Gaussians (Fig- 
ure Id). By that mechanism, we get a tractable high dimensional density model 
P(z): A set of low-dimensional coordinates in conjunction with a probabilistic 
map / yield a mixture of high dimensional Gaussians whose covariance matrices 
are smoothly shared between components. As shown in Figure 1(d), the model 
is able to capture high dimensional covariance structure along the data manifold 
by relatively few parameters (compared to D^), namely the latent coordinates 
X e R''^^ and the hyperparameters = [W, ct/, ct^, Vx] € 1R+ +^ of the GP. 

The role of the latent coordinates X is twofold: they both define the GP, 
mapping the latent points into the observed space, and they serve as centres 
of the mixture density in the latent space. If the latent density is a mixture of 
Gaussians, the centres of these Gaussians are used to define the GP map, but 
the full Gaussians (with covariance Vx) are projected forward by the GP map. 

2 Learning Algorithm 

Learning or model fitting is done by minimising a loss function L w.r.t. the latent 
coordinates X and the hyperparameters 9. In the following, we will discuss the 
usual GPLVM objective function, make clear that it is not suited for density 
estimation and use leave-out estimation to avoid overfitting. 

2.1 GPLVM likelihood 

A GPLVM [ ] is trained by setting the latent coordinates X and the hyperpa- 
rameters 9 to maximise the probability of the data 



that is the product of the marginal likelihoods of D independent regression 
problems. Using ^ = iK^^(Z^Z— £'K)K^^, conjugate gradients optimisation 
at a cost of 0{DN^) per step is straightforward but suffers from local optima. 

However, optimisation of Lzp^,9) does not encourage the GPLVM to be 
a good density model. Only indirectly, we expect the predictive variance to be 
small (implying high density) in regions supported by many data points. The 
main focus of ^^(X, 9) is on faithfully predicting Z from X (as implemented by 
the fidelity trace term) while using a relatively smooth function (as favoured by 
the log determinant term). Therefore, we propose a different cost function. 

2.2 General leave-out estimators 

Density estimation [lo] constructs parametrised estimators Pe(z) from iid data 
~ P(z). We use the Kullback-Leibler divergence J (9) = -/P(z)lnPe(z)dz 




(1) 
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to the underlying density and its empirical estimate Je{0) — — ^^i^j^n-Feji^^) 
as quality measure where / emphasises that the full dataset has been used for 
training. This estimator, is prone to overfitting if used to adjust the param- 
eters via 6* = argmine je(0). Therefore, estimators based on K subsets of 
the data Jv{9) = Efc=i X^j^/^ l^'P'ejfc (z*), Ik C I are used. Two well 
known instances are ^-fold cross-validation (CV) and leave-one-out (LOO) es- 
timation. The subsets for CV are Ik H Ik' = 0,/ = UfcLi^fe: l^fel ~ l^fc'l ^^'^ 
K = N, Ik = I\{k} for LOO. Both of them can be used to optimise 9. 

2.3 GPLVM leave-one-out density 

There are two reasons why training a GPLVM with the log likelihood of the data 
Lz(^,0) (Eq. 1) is not optimal in the setting of density estimation; Firstly, it 
treats the task as regression, and doesn't explicitly worry about how the density 
is spread in the observation space. Secondly, our empirical results (see Section 3) 
indicate, that the test set performance is simply not good. Therefore, we propose 
to train the model using the leave-out density 

N N 

- LLooi^-e) = InHP^^K) = 5]ln^^^AA(z>,,,i:,,) . (2) 

■i— 1 i— 1 j^i 

This objective is very different from the GPLVM criterion as it measures how 
well a data point is explained under the mixture models resulting from projecting 
each of the latent mixture components forward; the leave-out aspect enforces that 
the point z* gets assigned a high density even though the mixture component 
A/" (z*|/ixi J -S'xi) has been removed from the mixture. The leave-one-out idea is 
trivial to apply in a mixture setting by just removing the contribution in the sum 
over components, and is motivated by the desire to avoid overfitting. Evaluation 
of Lloo(X, 6) requires 0{DN^) assuming N > D > d. 

However, removing the mixture component is not enough since the latent 
point X* is still present in the GP. Using rank one updates to compute inverses 
and determinants of covariance matrices K^; with row and column i removed, 
it is possible to evaluate Eq. 3 for mixture components A/" (z'|/x^], with 
latent point x* removed from the mean prediction /i^* - which is what we do 
in the experiments. Unfortunately, going further by removing x* also from the 
covariance S-^j increases the computational burden to 0{DN'^) because we need 
to compute rank one corrections to all matrices K^, £ — 1..N. Since S^j is only 
slightly smaller than J^xJ , we refrain from computing it in the experiments. 

In the original GPLVM, there is a clear one-to-one relationship between latent 
points X* and data points z' - they are inextricably tied together. However, 
the leave-one-out (LOO) density Lioo(X, 0) does not impose any constraint of 
that sort. The number of mixture components does not need to be A^, in fact 
we can choose any number we like. Only the data visible to the GP {x-', z-'} 
is tied together. The actual latent mixture centres X are not necessarily in 
correspondence with any actual data point z*. However, we can choose Z to be 
a subset of Z. This is reasonable because any mixture centre /i.xJ — ZK~^k(x^ ) 
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(corresponding to the latent centre x-') lies in the span of Z, hence Z should 
approximately span Z. In our experiments, we enforce Z = Z. 

2.4 Overfitting avoidance 

Overfitting in density estimation means that very high densities are assigned to 
training points, whereas very low densities remain for the test points. Despite 
its success in parametric models, the leave-one-out idea alone, is not sufficient 
to prevent overfitting in our model. When optimising LlooO^:^) w.r.t. (X, 0) 
using conjugate gradients, we observe the following behaviour: The model cir- 
cumvents the LOO objective by arranging the latent centres in pairs that take 
care of each other. More generally, the model partitions the data Z C into 
groups of points lying in a subspace of dimension < D — 1 and adjusts (X, 6) 
such that it produces a Gaussian with very small variance in the orthogonal 
complement of that subspace. By scaling aj_ to tiny values, LlooP^, d) can be 
made almost arbitrarily large. It is understood that the hyperparameters of the 
underlying GP take very extreme values: the noise variance tr^ and some length 
scales Wi become tiny. In Lz(X,0), this is penalised by the ln|K| term, but 
Lloo{^i0) is happy with very improbable GPs. In our initial experiments, we 
observed this "cheating behaviour" on several of datasets. 

We conclude that even though the LOO objective (Eq. 3) is the standard 
tool to set KDE kernel widths [13], it breaks down for too complex models. We 
counterbalance this behaviour by leaving out not only one point but rather P 
points at a time. This renders cheating tremendously difficult. In our experiments 
we use the leave- P-out (LPO) objective 

LiPO (X, 0) = - 5] In E (^^ 1 ' ) . (3) 

Ideally, one would sum over all K = (p) subsets h € / of size = N — P. 
However, the number of terms K soon becomes huge: K « for P ^ N. 
Therefore, we use an approximation where we set K — N and Ik contains the 
indices j that currently have the smallest value N (z'^|/i^], S-x_3)- 

All gradients and ^^qq" can be computed in 0{DN^) when using 

/.t^]. However, the expressions take several pages. We use a conjugate gradient 
optimiser to find the best parameters X and 0. 

3 Experiments 

In the experimental section, we show that the GPLVM trained with Lz (X, 6) 
(Eq. 1) does not lead to a good density model in general. Using our L^po training 
procedure (Section 2.4, Eq. 3), we can turn it into a competitive density model. 
We demonstrate that a latent variance Vx >- improves the results even further 
in some cases and that on some datasets, our density model training procedure 
performs better than all the baselines. 
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3.1 Datasets and baselines 

We consider 9 data sets^, frequently used in machine learning. The data sets 
differ in their domain of application, their dimension D, their number of instances 
A'^ and come from regression and classification. In our experiments, we do not 
use the labels. 



dataset 


breast 


crabs 


diabetes 


ionosphere 


sonar 


usps 


abalone 


body fat 


housing 


N,D 


449,9 


200,6 


768,8 


351,33 


208,60 


9298,256 


4177,8 


252,14 


506,13 



We do not only want to demonstrate that our training procedure yields better 
test densities for the GPLVM. We are rather interested in a fair assessment of how 
competitive the GPLVM is in density estimation compared to other techniques. 
As baseline methods, we concentrate on three standard algorithms: penalised 
fitting of a mixture of full Gaussians (gm), kernel density estimation (kde) and 
manifold Parzen windows [(>] (mp). We run these algorithms for three different 
type of preprocessing: raw data (r), data scaled to unit variance (s) and whitened 
data (w) . We explored a large number of parameter settings and report the best 
results in Table 1. 

Penalised Gaussian mixtures In order to speed up EM computations, we 
partition the dataset into K disjoint subsets using the means algorithm'^. We 
fitted a penalised Gaussian to each subset and combined them using the relative 
cluster size as weight P(z) ~ jfJ2k ^k^ki^)- Every single Gaussian Pfc(z) has 
the form Pa;(z) = A/'(z|m'', C*"' + wl) where m'^ and C*^ equal the sample mean 
and covariance of the particular cluster, respectively. The global ridge parameter 
w prevents singular covariances and is chosen to maximise the LOO log density 
-L{w) = InU^V^jiz.^) = InlljEfc^feA/'lz^lm^i.C^j +^1). We use simple 
gradient descent to find the best parameter w S M+. 

Diagonal Gaussian KDE The kernel density estimation procedure fits a mix- 
ture model by centring one mixture component at each data point z*. We use 
independent multi-variate Gaussians: P(z) = A/'(z|z*, W), where the di- 

agonal widths W = 'Dg{wi, ..,wd) are chosen to maximise the LOO density 
-i(W) = InUjV^ji^^) = InOj wEi^^j A/'Cz^lz'^W). We employ a Newton- 
scheme to find the best parameters W G M.^. 

Manifold Parzen windows The manifold Parzen window estimator [ -] tries to 
capture locality by means of a kernel /c. It is a mixture of A'' full Gaussians where 

the covariance = wl+ {J2j^^Hz\'2'^)i2.' - z^){z' - z^)'^)/{J2j^^H'^\^^)) of 
each mixture component is only computed based on neighbouring data points. 

^ http : //www . csie .ntu. edu. tw/~c jlin/libsvmtools/datasets/ 
^ http : / / cseweb .ucsd . edu/~elkan/f astkmeans .html 
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dataseL 


breast 




diabetes 


ionosphere 




usps 




bodyf at 




N„. = 50 


-9.1 gm(U),s) 


0.9 gm(5,r) 


-11.0 gm(4,r) 


-34.1 gm(10,r) 


-67.7 gm(l,r) 


18.4 gin(l,r) 


12.5 gm(8,r) 


-36.0 gm(l,w) 


-33.4 gin(6,a) 


N,r = 100 


-8.6 gm{4,r) 


1.9 gm(7,r) 


-10.0 gm(3,w) 


-30.5 gm(13,r) 


-62.0 gm(l,r) 


124.8 gm(l,r) 


13.9 gm(5,r) 


-35.2 gm(2,w) 


-30.6 mp(6,21,B) 


JV„ = 150 


-8.4 gm{9,s) 




-9.6 gm(3,w) 


-33.9 gm(6,s) 


-61.5 gm(4,w) 


185.4 gm(l,r) 


14.3 gm(10,r) 


-34.7 gm(5,w) 


-29.1 mp(6,32,s) 


N,r = 200 


-8.2 gm{9,r) 




-8.3 gm(4,w) 


-31.8 gm(6,w) 




232.6 gm(3,w) 


14.2 gm(13,r) 




-23.5 gm{4,s) 


JV„ = 250 


-8.1 gm(ll,r) 




-8.2 gm(5,w) 






261.6 gm(6,w) 


14.3 gm(13,r) 




-16.0 gm(3,w) 



Table 1. Average log test densities over 10 random splits of the data. We did 
not allow Ntr to exceeded N/2. We only show the method yielding the highest 
test density among the three baseline candidates penalised full Gaussian mix- 
ture gm(K,p), diagonal Gaussian kernel density estimation kde(p) and manifold 
Parzen windows mp(c?,r,p). The parameter K = {1, .., 13} is the number of cluster 
centers used for gm, d = \D ■ {5, 12, 19, 26, 33, 40}/100] is the number of latent 
dimensions and r = \N ■ {5, 10, 15, 20, 25, 30}/100] the neighbourhood size for 
mp and p is saying which preprocessing has been used (raw r, scaled to unit vari- 
ance s, whitened w). The Gaussian mixture model yields in all cases the highest 
test density except for one case where the Parzen window estimator performs 
better. 



As proposed by the authors, we use the r-nearest neighbour kernel and do not 
store full covariance matrices but a low rank approximation Ki wl + VV^ 
with V G K^^''. As in the other baselines, the ridge parameter w is set to 
maximise the LOO density. 

Baseline results The results of the baseline density estimators can be found in 
Table 1. They clearly show three things: (i) More data yields better performance, 
(ii) penalised mixture of Gaussians is clearly and consistently the best method 
and (iii) manifold Parzen windows [ ] offer only little benefit. The absolute values 
can only be compared within datasets since linearly transforming the data Z by 
P results in a constant offset In |P| in the log test probabilities. 

3.2 Experimental setting and results 

We keep the experimental schedule and setting of the previous Section in terms of 
the 9 datasets, the 10 fold averaging procedure and the maximal training set size 
Ntr = N/2. We use the GPLVM log likehhood of the data ^^(X,©), the LPO 
log density with deterministic latent centres (LiP(9-det(X, 0), Vx = 0) and 
the LPO log density using a Gaussian latent centres LiPo~rd(X, 6) to optimise 
the latent centres X and the hyperparameters 0. Our numerical results include 3 
different latent dimensions d, 3 preprocessing procedures and 5 different numbers 
of leave-out points P. Optimisation is done using 600 conjugate gradient steps 
alternating between X and 9. In order to compress the big amount of numbers, 
we report the method with highest test density as shown in Figure 2, only. 

The most obvious conclusion, we can draw from the numerical experiments, 
is the bad performance of ^^(X, 0) as a training procedure for GPLVM in the 
context of density modeling. This finding is consistent over all datasets and 
numbers of training points. We get another conclusive result in terms of how 
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breast, N-449, D=9 



crabs, N-200, D-6_ 



diabetes, N=768, D-8 




50 100 150 200 224 

ionosphere, N-351, D=33 




100 150 175 

abalone, N-4177, D-8 




baseline "^a^ 



-baseline 
4) MFA 




sonar, N-208, D-60 




50 104 

bodyfat, N=252, D=14 




50 100 150 200 250 
usps, N.9298, D.256 




100 150 
housing, N=506, D-13 




Fig. 2. Each panel displays the log test density averaged over 10 random splits 
for three different GPLVM training procedures and the best out of 41 baselines 
(penalised mixture k = 1..13, diag.+isotropic KDE, manifold Parzen windows 
with 36 different parameter settings) as well as various mixture of factor analy- 
sers (MFA) settings as a function of the number of training data points Ntr- We 
report the maximum value across latent dimension d = {1,2,3}, three prepro- 
cessing methods (raw, scaled to unit variance, whitened) and P = {1, 2, 5, 10, 15} 
leave-out points . The GPLVM training procedures are the following: L^po-rd: 
stochastic leave- P-out density (Eq. 3 with latent Gaussians, Vx >- 0), Lipo-det: 
deterministic leave-P-out density (Eq. 3 with latent Diracs, Vx = 0) and Lz- 
marginal likelihood (Eq. 1). 



the latent variance Vx influences the final test densities'^ Only in the bodyfat 
data set it is not beneficial to allow for latent variance. It is clear that this is 
an intrinsic property of the dataset itself, whether it prefers to be modelled by 
a spherical Gaussian mixture or by a full Gaussian mixture. 

An important issue, namely how well a fancy density model performs com- 
pared to very simple models, has in the literature either been ignored [7,8] or 
only done in a very limited way ['']. Experimentally, we can conclude that on 
some datasets e.g. diabetes, sonar, abalone our procedure cannot compete 
with a plain gm model. However note, that the baseline numbers were obtained 
as the maximum over a wide (41 in total) range of parameters and methods. 

For example, in the usps case, our elaborate density estimation procedure 
outperforms a single penalised Gaussian only for training set sizes Ntr > 100. 
However, the margin in terms of density is quite big: On Ntr = 150 prewhitened 
data points LlpoO^, ^) ^ith deterministic latents yields 70.47 at = 2, whereas 
full LLPo(X.,d) reaches 207 at d = 4 which is significantly above 185.4 as ob- 
tained by the gm method - since we work on a logarithmic scale, this corresponds 
to factor of 2.4 • 10^ in terms of density. 



^ In principle, Vx could be fixed to I because its scale can be modelled by X. 
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3.3 Running times 

While the baseline methods such as gm , kde and mp run in a couple of minutes 
for the usps dataset, training a GPLVM with either Llpo(X, 0), Vx = or 
^^(X, 6) takes considerably longer since a lot of cubic covariance matrix opera- 
tions need to be computed during the joint optimisation of (X, 6). The GPLVM 
computations scale cubically in the number of data points Ntr used by the GP 
forward map and quadratically in the dimension of the observed space D. The 
major computational gap is the transition from Vx = to Vx >- because in 
the latter case, covariance matrices of size D'^ have to be evaluated which cause 
the optimisation to last in the order of a couple of hours. To provide concrete 
timing results, we picked Ntr = 150, d — 2, averaged over the 9 datasets and 
show times relative to Lz- 



alg 


gm(l) 


gm(lO) 


kde 


mp 


mf a 


Lz 




LLPO-rd 




0.27 


0.87 


0.93 


1.38 


0.30 


1.00 


35.39 


343.37 



Note that the methods L^po are run in a conservative fail-proof black box 
mode with 600 gradient steps. We observe good densities after considerably less 
gradient steps already. Another straightforward speedup can be obtained by 
carefully pruning the number of inputs to the L^po models. 

4 Conclusion and Discussion 

We have discussed how the basic GPLVM is not in itself a good density model, 
and results on several datasets have shown, that it does not generalise well. We 
have discussed two alternatives based on explicitly projecting forward a mixture 
model from the latent space. Experiments show that such density models are 
generally superior to the simple GPLVM. 

Among the two alternative ways of defining the latent densities, the simplest 
is a mixture of delta functions, which - due to the stochasticity of the GP map 
- results in a smooth predictive distribution. However, the resulting mixture of 
Gaussians, has only axis aligned components. If instead the latent distribution 
is a mixture of Gaussians, the dimensions of the observations become correlated. 
This allows the learnt densities to faithfully follow the underlying manifold. 

Although the presented model has attractive properties, some problems re- 
main: The learning algorithm needs a good initialisation and the computational 
demand of the method is considerable. However, we have pointed out that in 
contrast to the GPLVM, the number of latent points need not match the number 
of observations allowing for alternative sparse methods. 

We have detailed how to adapt ideas based on the GPLVM to density mod- 
eling in high dimensions and have shown that such models are feasible to train. 

References 

1. Izenman, A.J.: Recent developments in nonparametric density estimation. Journal 
of the American Statistical Association 86 (1991) 205-224 1 



Gaussian Mixture Modeling with GPLVMs 



11 



2. Ghahramani, Z., Beal, M.J.: Variational inference for Bayesian mixtures of factor 
analysers. In: NIPS 12. (2000) 1 

3. Rosenblatt, M.: Remarks on some nonparametric estimates of a density function. 
Annals of Mathematical Statistics 27(3) (1956) 832-837 1 

4. Parzen, E.: On estimation of a probability density function and mode. Annals of 
Mathematical Statistics 33(3) (1962) 1065-1076 1 

5. Rudemo, M.; Empirical choice of histograms and kernel density estimators. Scan- 
dinavian Journal of Statistics 9 (1982) 65-78 1 

6. Vincent, P., Bengio, Y.: Manifold parzen windows. In: NIPS 15. (2003) 1, 7, 8, 9 

7. Bishop, C.M., Svensen, M., Williams, C.K.I. : The generative topographic mapping. 
Neural Computation 1 (1998) 215-234 2, 9 

8. Roweis, S., Saul, L.K., Hinton, G.E.: Global coordination of local linear models. 
In: NIPS 14. (2002) 2, 9 

9. Lawrence, N.: Probabilistic non-linear principal component analysis with Gaussian 
process latent variable models. JMLR 6 (2005) 1783-1816 2, 3, 4 

10. Rasmussen, C.E., Williams, C.K.I. : Gaussian Processes for Machine Learning. The 
MIT Press, Cambridge, MA (2006) 2 

11. Rasmussen, C.E.: The infinite Gaussian mixture model. In: NIPS 12. (2000) 2 

12. Quiiionero-Candela, J., Girard, A., Rasmussen, C.E.: Prediction at an uncertain 
input for CPs and RVMs. Technical Report IMM-2003-18, TU Denmark (2003) 3 

13. Wasserman, L.: All of Nonparametric Statistics. Springer (2006) 4, 6 



Appendix Both K, = E[kk^] = [fc*(x% x-')]„- and k* = E[k] = [k{x.^)]j are 
the following expectations of k = [fc(x, x^), .., fc(x, x-*^)]^ w.r.t. A/'(x|x*, Vx): 



aj I VxW-i + I| ' p (Vx + W, x' - X,) , p(D, y) 




